Using LoRA to Fine Tune Gemma Models for Text to SQL
DSAN 5800 Final Project Report
1 Abstract
A core competency of recent Large Language Models is an increasingly strong performance on structured translation tasks. The ability to convert natural-language text into computer-understandable formats, such as SQL, offers new opportunities to transform abstract data requests into precise, repeatable queries. The process from text to SQL involves translating written text, such as English, into SQL queries that can be executed on running databases. This task is challenging because, for low-parameter language models, SQL queries need to be generated from variable database schemas, and SQL has strict syntax requirements, making logical forms very fragile to small perturbations. Since it is computationally expensive to train a full-scale large language model for these specific tasks. Also, for any non-enterprise application it is practically infeasible. Moreover, relying on business model APIs can also lead to resource limitations. Therefore, this project aims to investigate whether applying a parameter-efficient fine-tuning approach, Low-Rank Adaptation (LoRA), can make Google’s Gemma-3 4B model, specifically designed for basic text-to-SQL tasks. This project constructed a supervised fine-tuning pipeline. Keeping the pre-trained backbone network remains fixed, using a rank thirty-two LoRA adapter applied to all major projection matrices and quantized in four-bit NF4 precision. A large text-to-SQL corpus, WikiSQL, provides general SQL generation competency, while evaluation occurs on a held out set of bifurcated WikiSQL tables to measure generalization.
NEED TO BE HUMNIZED[PLACEHOLDER FOR BRIEF DESCRIPTION OF RESULTS HERE] The evaluation pipeline executes predicted SQL against SQLite databases derived from table metadata and compares both execution results and a suite of similarity metrics. The trained model exhibits substantial improvements across SQL validity, execution accuracy, lexical similarity, and structural SQL alignment. These results demonstrate the effectiveness of LoRA-based adaptation for SQL generation tasks in low-resource computational settings.
2 Introduction
Text-to-SQL generation aims to translate natural language questions into executable database queries. While large language models increasingly exhibit strong zero-shot and few-shot reasoning capabilities, they struggle with the precision required for SQL generation. A single misplaced identifier or incorrect operator renders a query invalid. Moreover, SQL is always grounded in a specific schema: the model must understand which columns exist, how they relate to the question, and how to assemble a logically consistent query.
Fine-tuning the full weight matrices of an LLM for these behaviors requires significant computational resources. Parameter-efficient methods instead introduce small trainable modules into a frozen backbone. LoRA is among the most successful of these methods, projecting updates into low-dimensional subspaces and thereby reducing the number of trainable parameters by orders of magnitude. This project applies LoRA to Google’s Gemma-3 4B pretrained model with the objective of adapting the model toward structured SQL generation behaviors without updating its billions of parameters.
The project unfolds across three primary phases. First, a synthetic SQL dataset is used for supervised fine-tuning with LoRA, enabling the model to learn generic SQL patterning. Second, a standardized evaluation on WikiSQL is conducted using a custom SQLite execution pipeline. Third, model outputs are analyzed along semantic, syntactic, and execution dimensions. These phases jointly support the conclusion that LoRA-enabled fine-tuning can reliably transfer SQL generation capabilities from synthetic to real datasets.
3 Literature Review
The design of this project draws from several key areas in the literature. Parameter-efficient fine-tuning methods, including LoRA, have shown that full-model updates are unnecessary for many downstream tasks. The LoRA formulation introduced by Hu et al. (2021) demonstrates that weight updates in large transformer models often lie near low-rank subspaces. A frozen weight matrix \(W_0\) is therefore augmented with a rank-\(r\) update expressed as
\[ W = W_0 + \Delta W \qquad \text{where} \qquad \Delta W = B A \]
where \(A \in \mathbb{R}^{k \times r}\) and \(B \in \mathbb{R}^{r \times d}\) are trainable matrices with \(r \ll \min(d, k)\). This formulation considerably reduces the number of trainable parameters while enabling expressive adaptation.
Dettmers et al. (2023) introduced QLoRA, which combines LoRA updates with four-bit quantization of the frozen base model. QLoRA demonstrated that powerful models can be fine-tuned on single-GPU hardware, laying the foundation for the training strategy used in this project. By quantizing the Gemma-3 4B model into NF4 format and training only LoRA adapters, this project follows the practical recommendations of QLoRA to maximize efficiency and stability.
The foundational computational structures of transformer models, including self-attention, residual pathways, and feed-forward projections, follow directly from the course materials in DSAN 5800. Larson (2025a) reviews sequence modeling fundamentals and motivates the use of attention mechanisms. Larson (2025b) expands this foundation into full transformer architectures, describing multi-head self-attention, positional encoding, and the stacking of transformer blocks. These theoretical components directly inform the deeper architectural review provided in this report.
Finally, the WikiSQL dataset has played a central role in text-to-SQL research since its introduction by Zhong et al. It is notable for its constrained SQL grammar and the large variety of tables, which makes it suitable for evaluating generalization across schema-specific tasks. Existing research demonstrates that models must integrate schema information and natural language semantics to perform well on WikiSQL. This project adopts the standard execution-based evaluation criterion, aligning with prior work.
Together, these strands of research motivate the methodological design of this project: use LoRA to specialize a mid-sized LLM for SQL generation, leverage four-bit quantization to control memory costs, and evaluate rigorously using execution correctness.
4 Dataset
This project relies exclusively on the WikiSQL dataset for both supervised fine-tuning and evaluation. The approach taken here uses only real WikiSQL examples so that the fine-tuned model learns directly from the human-generated question–SQL pairs and table schemas. This constraint has the large benefit that it simplifies the experimental setting and ensures that evaluation reflects performance on the same structural distribution seen during training.
4.1 Dataset Characteristics
WikiSQL is well-suited for studying foundational text-to-SQL behavior for three primary reasons. First, it features a simple but still realistic SQL grammar, where each query operates over exactly one table, includes at most a single aggregation operator, and consists of a conjunction of comparatively shallow filtering conditions. This design isolates the core translation problem—mapping natural language onto SQL relational operators—while avoiding confounds such as multi-table joins, nested subqueries, or complex predicate structure. Second, WikiSQL provides unusually high schema and task diversity: it contains over 24,000 unique tables, each with different column names, column orders, and data distributions. Because every example is grounded in its own table, the model cannot rely on memorized column labels or SQL templates, and must instead learn generalizable alignment between linguistic references (e.g., “what is the population”) and arbitrary schema fields. This property makes WikiSQL a powerful testbed for evaluating schema-conditioned generalization in LLMs. Third, the dataset is mature, publicly validated, and widely adopted in the semantic parsing community. With more than 80,000 natural-language questions paired with structured SQL programs, WikiSQL offers a sufficiently large corpus for supervised fine-tuning, along with clear execution-based evaluation metrics that allow unambiguous correctness checking.
A critical aspect of WikiSQL, what allows for supervised fine tuning, is that each example includes not only a question and table schema, but also a canonical SQL program that expresses the correct or expected answer for each natural language question. This canonical program, often referred to as the gold query, serves as the ground-truth target during training and evaluation. The presence of these structured gold queries enables both supervised fine-tuning and execution-based evaluation where the predicted query generated by an LLM and the gold query can be run against the same table to verify whether they return equivalent answers. The inclusion of gold queries is one of the primary reasons WikiSQL remains such a widely used benchmark.
In keeping with best practices for machine learning, the training, validation, and test sets are partitioned such that there is no overlap of tables between them and each unique table occurs in only one split. This ensures that model evaluation reflects genuine generalization to unseen schemas, rather than memorization of patterns or data from previously observed tables. A summary of the dataset sizes used in this project is shown below:
| Split | Rows | Used For |
|---|---|---|
| Train | 56,355 | Training set |
| Dev | 8,421 | Validation & monitoring |
| Test | 15,878 | Test Set |
These counts correspond to the examples that remained after filtering malformed SQL programs, invalid schemas, or tables that could not be serialized cleanly. Each surviving row is represented using four structured features
- A question in ordinary English.
- The table schema, specifying all column names and data types.
- The target SQL query, represented as both a logical tuple and executable SQL text.
- Example table rows for reconstructing the corresponding SQLite table.
Example WikiSQL Data Item
Question: Tell me what the notes are for South Australia
Table Headers:
- State/territory
- Text/background colour
- Format
- Current slogan
- Current series
- Notes
Example Rows:
| State/territory | Text/background colour | Format | Current slogan | Current series | Notes |
|-----------------------------|-----------------------|-------------|--------------------------------------|---------------|---------------------------------------|
| Australian Capital Territory| blue/white | Yaa·nna | ACT · CELEBRATION OF A CENTURY 2013 | YIL·00A | Slogan screenprinted on plate |
| New South Wales | black/yellow | aa·nn·aa | NEW SOUTH WALES | BX·99·HI | No slogan on current series |
| New South Wales | black/white | aaa·nna | NSW | CPX·12A | Optional white slimline series |
| Northern Territory | ochre/white | Ca·nn·aa | NT · OUTBACK AUSTRALIA | CB·06·ZZ | New series began in June 2011 |
| Queensland | maroon/white | nnn·aaa | QUEENSLAND · SUNSHINE STATE | 999·TLG | Slogan embossed on plate |
| South Australia | black/white | Snnn·aaa | SOUTH AUSTRALIA | S000·AZD | No slogan on current series |
| Victoria | blue/white | aaa·nnn | VICTORIA - THE PLACE TO BE | ZZZ·562 | Current series will be exhausted this year |
Logical SQL:
sel = 5
agg = 0
conds = [[3, 0, "SOUTH AUSTRALIA"]]
Executable SQL:
SELECT "Notes"
FROM data
WHERE "Current slogan" = 'SOUTH AUSTRALIA'
4.2 Preprocessing and Prompt Construction
To convert WikiSQL examples into training-ready sequences, each example is reformatted into a prompt–completion pair using a strict, instruction-driven structure. This design aligns with the constraints of a causal decoder-only model such as Gemma, which generates output exclusively through next-token prediction. By placing the SQL query in the completion rather than embedding it within the prompt, the supervision signal directly mirrors the model’s inference-time behavior.
The prompt itself is organized into clearly delineated sections - <INSTRUCTIONS>, <SCHEMA>, <QUESTION>, and <SQL_Query> - each represented by explicit opening and closing tags that appear verbatim in the input text. These tags function as unambiguous structural boundaries, ensuring that the model can reliably distinguish between natural-language instructions, schema metadata, question content, and the region where SQL generation begins. Such explicit segmentation has been shown to stabilize text-to-SQL training by reducing format drift, preventing the model from mixing explanation and code, and framing SQL prediction as a discrete subtask rather than a free-form generation problem. The resulting prompt format is therefore both human-interpretable and optimized for predictable model behavior.
Importantly, in the training pipeline all row level table data and the logical SQL encodings (sel/agg/conds) are intentionally excluded from the LLM’s input. Only the schema representation and the natural language question are included in the prompt, and only the executable SQL string appears in the completion. This design mirrors practical deployment settings, where exposing full table contents would be infeasible due to size, privacy, or API constraints. It also forces the model to infer column relevance, aggregation type, and filtering conditions purely from the question–schema alignment rather than memorizing example values. The exclusion of table rows prevents shortcut learning, such as exploiting statistical correlations between specific cell values and question patterns. By contrast, the serialized schema gives the model exactly the minimal structural information required to produce syntactically valid and semantically grounded SQL queries. The strict tag structure and constrained input-output interface jointly act as a curriculum and, along with the consistent instructions field, they work to guide the model toward producing clean, well-formed SQL while minimizing opportunities for hallucination or format drift.
The schema is serialized into the following uniform text structure:
PROMPT FIELD:
<INSTRUCTIONS>
You are a precise text-to-SQL generator. Using the known schema of the sql
database you must output only a valid SQL query and nothing else.
</INSTRUCTIONS>
<SCHEMA>
- State/territory (TEXT)
- Text/background colour (TEXT)
- Format (TEXT)
- Current slogan (TEXT)
- Current series (TEXT)
- Notes (TEXT)
</SCHEMA>
<QUESTION>
Tell me what the notes are for South Australia
</QUESTION>
<SQL_Query>
COMPLETION FIELD:
SELECT "Notes" FROM data
WHERE "Current slogan" = 'SOUTH AUSTRALIA'
</SQL_Query>
4.3 SQLite Reconstruction and Executable SQL
For evaluation, each WikiSQL table is materialized as a lightweight, in-memory SQLite database so that both gold and model-generated queries can be executed under a consistent, realistic SQL semantics. SQLite is a natural choice in this setting because it supports the core subset of SQL operators used in WikiSQL (selection, aggregation, and simple conjunctions), imposes strict syntactic requirements that expose malformed queries, and can be constructed directly from the table headers and rows provided in the dataset. Concretely, each table object is first converted into a Pandas DataFrame using the original column headers and row values.
This DataFrame is then written into an ephemeral SQLite database under a fixed table name (data) as shown:
import sqlite3
# df is the DataFrame representation of the WikiSQL table
conn = sqlite3.connect(":memory:") # create an in-memory SQLite database
df.to_sql("data", conn, index=False, if_exists="replace")The logical SQL encoding provided by WikiSQL (sel, agg, conds) is mapped into an executable SQL string using the logical_to_sql() helper, which expands the selected column index, aggregation operator, and condition tuples into a syntactically valid SELECT ... FROM data WHERE ... statement.
At evaluation time, both the gold SQL and the model’s predicted SQL are executed against the same in-memory database using a shared execute_sql() wrapper:
def execute_sql(conn, sql):
try:
cursor = conn.execute(sql)
return cursor.fetchall()
except Exception as e:
return f"Error: {e}"This design serves multiple purposes: it enforces that the model produce not just syntactically valid SQL but also semantically executable queries; it decouples the modeling code from any particular storage backend; and it enables execution-based metrics (e.g., whether the predicted query returns exactly the same result set as the gold query) that directly reflect end-to-end task success.
5 Methods
5.1 Gemma-3 Model Architecture
Gemma-3 4B is a mid-sized transformer-based language model designed by Google with efficiency, adaptability, and fine-tuning stability as primary objectives. At approximately four billion parameters, it occupies a practical middle ground: large enough to support high-quality text generation and structured reasoning tasks, yet compact enough to be fine-tuned on a single modern GPU using parameter-efficient methods such as LoRA. The model follows the general decoder-only transformer paradigm common to contemporary large language models, but introduces several refinements—including hybrid attention patterns, large vocabulary capacity, and optimized projection structures—to improve performance on long-context, instruction-following, and code-generation tasks.
At a high level, Gemma-3 4B consists of 34 transformer blocks, each operating over a hidden dimensionality of 2560 units. The architecture uses rotary positional embeddings to encode word order and maintain relative positional invariances, which improves generalization across sequence lengths. Attention is implemented via multi-head self-attention with 8 heads, and select layers employ sliding-window attention to reduce computational cost on long sequences. These local attention blocks are interspersed with globally attentive layers to periodically refresh holistic contextual representations, a strategy that allows Gemma-3 to scale to long contexts without incurring the quadratic cost of full attention at every layer.
A summary of the most relevant architectural hyperparameters is presented below:
| Feature | Value / Description |
|---|---|
| Total Trainable Parameters | 1,342,504,960 (~23%) |
| Total Non-trainable Parameters | 5,708,161,392 |
| Number of Layers | 34 Transformer blocks |
| Hidden Size | 2560 |
| Attention Heads | 8 (4 key/value heads) |
| MLP Intermediate Size | 10,240 |
| Activation Function | GELU-Tanh |
| Vocabulary Size | 262,208 tokens |
| Positional Encoding | RoPE with linear scaling (×8), effective context window: 131k tokens |
| Attention Type | Sliding-window (1024 tokens) + global full attention every 6 blocks |
Each transformer block in Gemma-3 follows the established pattern of attention → feed-forward → residual connection, but with careful engineering to improve numerical stability and training efficiency:
Multi-Head Self-Attention (MHSA).
The model computes learned projections for queries, keys, and values using matrices \(W_Q\), \(W_K\), and \(W_V\). These are grouped into multiple heads, which allows the model to attend to different types of dependencies in parallel. Attention is computed via the scaled dot-product formulation:\[ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left( \frac{Q K^\top}{\sqrt{d_k}} \right) V \]
The outputs of all heads are concatenated and passed through an output projection \(W_O\) before being added to the residual pathway.
Feed-Forward Network (FFN).
Following attention, each block applies a feed-forward MLP of the form\[ \mathrm{FFN}(x) = W_{\mathrm{down}}\, f\big(W_{\mathrm{up}}\, x\big) \]
where \(f\) is a smooth nonlinearity such as SiLU. The upward projection expands the dimensionality, allowing for richer intermediate representations, while the downward projection compresses the result back to the hidden dimension.
Normalization and Residual Structure.
Each sublayer includes normalization layers to regulate activation magnitudes and improve gradient flow. The residual pathways serve as shortcuts that mitigate vanishing-gradient issues and stabilize both pretraining and fine-tuning.
5.2 Low-Rank Adaptation (LoRA) Configuration
LoRA is a parameter-efficient adaptation method that tunes large language models by inserting small, trainable matrices—called adapters—into selected linear projections within a frozen base transformer. For a given base weight matrix \(W_0\), LoRA augments the weights with a low-rank update as follows:
\[ W = W_0 + \frac{\alpha}{r} A B \]
where \(A\) and \(B\) are trainable matrices of size \((d, r)\) and \((r, k)\), respectively (\(r\) denoting the low-rank dimension), and \(\alpha\) is a scaling factor to stabilize and scale the update. Only the small \(A\) and \(B\) matrices are modified during training; the full weight matrix \(W_0\) remains fixed. This setup dramatically reduces the number of tunable parameters, as the count now grows linearly with \(r\) instead of with the size of \(W_0\).
In our configuration, the LoRA adapters use rank \(r = 32\) and scaling factor \(\alpha = 16\), providing a strong balance between flexibility and stability on moderate hardware. Adapters are inserted into the most key linear projections: the query, key, value, and output weights in attention, as well as the gate, up, and down projections in the feed-forward sublayers. This placement enables the model to efficiently adapt its mechanisms for parsing schema and generating SQL from natural language.
Importantly, LoRA provides memory efficiency not just during training, but also when saving and deploying models. Because only the adapter (LoRA) weights are updated, these small matrices—rather than the entire set of base model weights—can be saved after fine-tuning. This means that storage and sharing of task-specific models becomes highly efficient: just the adapter file needs to be retained, and later it can be “attached” or merged with any compatible base model for easy reuse or deployment, without duplicating the large pretrained model files.
Our broader training approach follows the QLoRA framework: the base Gemma weights are held fixed and stored in quantized (NF4) precision for substantial GPU memory savings, while the smaller LoRA adapters are kept at higher precision for effective optimization. This decoupling lets us train and deploy strong models even on limited hardware, and further maximizes storage and sharing efficiency by siloing the small, learnable adapter parameters from the large, frozen base model.
The Colab training environment configures LoRA using the following settings
peft_config = LoraConfig(
r=32,
lora_alpha=16,
lora_dropout=0.05,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
bias="none",
task_type="CAUSAL_LM",
modules_to_save=["lm_head", "embed_tokens"]
)This configuration directs LoRA to focus its adaptation capacity on the projections most critical for SQL reasoning. It also preserves the token embedding and final language modeling head in full precision which helps maintain stable output distributions during training. The result is an efficient and effective fine tuning mechanism that improves task specific behavior without modifying the billions of pretrained parameters in the Gemma model.
5.3 Training Plan
The goal of the training pipeline is to adapt the Gemma 3 model, with roughly four billion parameters, to the text-to-SQL task using LoRA adapters while keeping the pretrained backbone frozen and quantized. The method follows a causal language modeling objective, which means that the model is trained to predict the next token in a sequence that contains both the prompt and the gold SQL completion. Because SQL prediction in deployment is also a continuation-based process, the training objective and inference behavior remain aligned, reducing the risk of format drift and encouraging the model to internalize SQL generation as a structured next-token prediction problem. Only the LoRA parameters are updated during training while the transformer backbone stays fixed in 4-bit NF4 precision. This separation provides a stable optimization surface and keeps the computational and memory demands well within the limits of a single GPU environment.
The WikiSQL dataset, once preprocessed, is fed into the model as a series of prompt–completion pairs. Each prompt contains an instruction block, a serialized table schema, and the user question, all wrapped in explicit XML-like tags that allow the tokenizer and model to reliably partition the semantic components of each example. The completion field then contains the corresponding SQL query ending with a closing tag. These examples are loaded into Hugging Face Datasets objects, which provide efficient memory mapping and automatic integration with the tokenization pipeline. During training, the SFTTrainer prepares each example by concatenating the prompt and SQL completion into a single sequence that reflects the exact text the model will see during inference. The tokenizer then applies Gemma’s byte-pair encoding rules to convert text into token IDs, and the trainer assembles these token sequences into batches. Because packing=True is enabled, multiple short examples are combined into the same 512-token window, reducing padding waste and ensuring that GPU memory is used efficiently.
Once tokenized, each sequence is copied into a parallel label sequence that is shifted by one position. This means that for token position t, the model attempts to predict token t+1. The resulting cross-entropy loss is computed only over the completion region of each sequence, so the model is not penalized for regenerating instructions or schema information. This selective masking ensures that the optimization pressure focuses on SQL formation rather than reproducing prompt structure. The causal objective therefore becomes a direct mechanism for teaching the model how to form valid SQL: each incorrect token prediction increases the loss, and the LoRA adapters update in a direction that improves the syntactic and semantic correctness of future predictions.
The mechanics of loss reduction follow the standard structure of a decoder-only transformer. For every token position, Gemma produces a probability distribution over the entire vocabulary. The training signal compares this distribution to the true next token through a cross-entropy objective. When the predicted distribution assigns too little mass to the correct SQL token, the associated gradient flows backward through the LoRA adapter matrices and updates them to improve future predictions. The frozen 4-bit backbone never changes, but it continues to supply rich contextual and semantic embeddings. The LoRA matrices essentially “bend” these pretrained features toward SQL behavior without disturbing the foundational linguistic knowledge encoded in Gemma. Gradient accumulation is employed to expand the effective batch size: although the nominal batch size per device is one, gradients from four successive forward–backward passes are aggregated before the optimizer performs a weight update. This method stabilizes training while avoiding memory overflows.
The training configuration defined through SFTConfig governs all aspects of optimization and resource management. The 512-token sequence length reflects the short nature of WikiSQL examples and avoids unnecessary cost. Packed sequences improve throughput, which is particularly important for quantized models whose performance often depends on efficient batching. Three epochs provide adequate exposure to the training distribution without encouraging memorization of table-specific patterns, especially because WikiSQL partitions tables across splits to prevent leakage. The learning rate is intentionally conservative, following QLoRA recommendations, because adapter tuning on a quantized backbone can become unstable at higher rates. A warmup period ensures that the optimizer does not immediately apply large updates, which helps prevent divergence in the early stages of training. Mixed precision, either FP16 or BF16 depending on hardware capabilities, reduces memory usage and speeds up computation. The use of gradient checkpointing further lowers memory pressure by recomputing intermediate activations during the backward pass. The constant learning rate scheduler maintains a steady optimization environment once warmup is complete. Each of these settings contributes to an efficient and stable fine-tuning procedure suitable for limited hardware.
During training, the SFTTrainer logs several metrics that provide insight into learning dynamics. The primary signal is the training loss, which shows how effectively the model is predicting SQL tokens as training progresses. Additional logs include the active learning rate, the number of tokens processed per second, and internal trainer statistics.
The training pipeline relies on an integrated stack of Python libraries. Hugging Face Transformers provides the Gemma architecture along with quantization hooks that allow NF4 loading through BitsAndBytes. The TRL library supplies the SFTTrainer, which automates dataset packing, label masking, forward passes, and loss computation for causal LM objectives. The PEFT library manages LoRA initialization, weight injection, and adapter merging. Hugging Face Datasets provides the backbone for efficient preprocessing and dataset streaming. SQLite and Pandas support evaluation, enabling real SQL execution through a lightweight, in-memory database. TensorBoard records scalar metrics that allow visual inspection of the loss curves throughout training. This entire stack is orchestrated inside a Google Colab environment, with persistent storage managed through Google Drive and experiment metadata captured through timestamped directories.
From end to end, training proceeds through a clear sequence of steps. WikiSQL examples are loaded, validated, and converted into prompt–completion structures. These examples are tokenized, packed, and fed into the model during supervised tuning. The Gemma backbone remains fixed while LoRA adapters gradually learn task-specific behavior. The trainer monitors loss and writes checkpoints at the end of each epoch. After fine-tuning is complete, the adapter weights and tokenizer are saved, and the model is evaluated through SQL execution on reconstructed SQLite tables.
5.4 Text to SQL Experimental Setup
The evaluation procedure is designed to compare two models under identical conditions. The first model is the unmodified Gemma 3 4B pretrained checkpoint, which serves as the baseline. The second is the fine-tuned LoRA-augmented Gemma 3 4B model produced by the training pipeline described earlier. Both models are evaluated using the same prompt generation process, the same SQL cleaning and extraction logic, and the same SQLite execution environment. This design ensures that measured performance differences derive from learned SQL generation behavior rather than from changes in preprocessing or evaluation tooling.
Evaluation begins by converting each WikiSQL test example into the same structured prompt format used during supervised training. The prompt consists of an instruction block, a serialized representation of the table schema, and the natural-language question. These components are wrapped in explicit XML-like tags, which allow the model to identify the point where SQL generation begins. The model receives no information about table rows during inference; it must infer the correct aggregation, selection, and filtering operations entirely from the schema and question. The prompt is then fed into the model, which generates a continuation consisting of a predicted SQL query. The query is extracted and cleaned through a standardized parsing step that identifies the first valid SQL-like region inside the
The predicted SQL is evaluated in several ways. First, the query is executed against a reconstructed SQLite database created from the table rows provided in the WikiSQL test set. Because each table in WikiSQL includes fully specified column names and row data, it is straightforward to materialize a corresponding SQLite table inside an in-memory database and then execute both the gold query and the predicted query within that environment. Execution results are compared directly. If the predicted SQL runs without a syntax error and returns the same result rows as the gold query, it is marked as correct. If the query is executable but returns an incorrect result set, it is counted as a wrong-result case. If the SQL engine raises a syntax or runtime error, the prediction is categorized as invalid. This execution-based evaluation is the most stringent test available because it verifies both syntactic correctness and semantic alignment with the true relational operation described in the question.
Beyond execution correctness, the system performs a suite of similarity measurements. These include Jaccard token similarity, normalized Levenshtein similarity, and structural alignment metrics that compare selected columns, referenced columns in the WHERE clause, and operator usage. These structural scores are important because SQL queries can differ superficially while still expressing the same computation, and in other cases, they can appear similar while differing in a subtle but meaningful way. Text-based metrics such as BLEU and ROUGE are computed as well, along with exact-match comparisons and token-level F1 scores. The combination of execution and similarity metrics provides a multidimensional picture of the model’s predictive behavior and helps identify whether errors arise from structural misunderstandings, lexical drift, or outright semantic failures.
Inference with Gemma 3 4B is not instantaneous. Each query requires prompt construction, tokenization, autoregressive generation, SQL cleaning, schema materialization, and SQLite execution. The average end-to-end time for a single evaluation example is approximately fifteen seconds in the Colab environment used for this project. Running the full WikiSQL test set, which contains over fifteen thousand examples, would therefore require several hours of continuous inference, far exceeding the time available under practical constraints. To address this constraint while preserving statistical integrity, the evaluation operates on a random ten percent sample of the full test set. A reproducible random seed selects this subsample, allowing the experiment to run within an acceptable time window while maintaining representative coverage of the underlying data distribution. Each sampled example goes through the full evaluation sequence, and the aggregated results constitute the reported performance metrics for both the baseline and trained models.
The final evaluation step involves computing summary statistics over all predicted queries. These summaries include the proportion of syntactically valid SQL queries, the proportion that execute correctly, averaged similarity metrics, and aggregate BLEU, ROUGE, EM, and F1 values. Row-level logs are saved in JSONL format, preserving each question, predicted SQL, gold SQL, execution result, and error type, while a separate summary file records the global metrics that form the basis of the quantitative results in the report. All evaluations are performed twice, once for the baseline model and once for the LoRA-fine-tuned model, enabling a direct and controlled comparison of the effect of LoRA adaptation on SQL generation behavior.
6 Results
6.1 Training Results
As we can see from Fig. 1, this plot measures the mean token-level accuracy during training for the LoRA-fine-tuned Gemma model. In the early training steps, the token accuracy increases rapidly and stabilizes near ENTER IF THERE IS A VALUE, which means that the model can converge quickly and effectively adapt to SQL generation. Also, we can see there are occasional sharp drops from time to time. Which likely correspond to more challenging training batches, but accuracy recovers quickly, indicating that the optimization process remains stable throughout training. Since this metric is calculated from training data, it primarily reflects learning dynamics rather than generalization performance.
As we can see from Fig. 2, this plot shows the gradient norm over training steps for the LoRA-fine-tuned Gemma model. In the early training steps, the gradient norm is high and fluctuates significantly, indicating that the parameters are updated frequently during the model learning process. Over time, the gradient norm decreases and stabilizes, indicating that the model is converging. Also, occasional spikes appear due to more challenging training batches. However, training remains stable and does not suffer from gradient explosion.
As we can see from Fig. 3, this plot measures the training loss over training steps for the LoRA-fine-tuned Gemma model. In the early training steps, the training loss decreases sharply, which means the model learns and adapts quickly from data. As the training continues, the decrease in training loss starts to drop slowly and stabilizes at a low value, indicating convergence. Also, from the loss curve, we can see there is a consistent downward trend despite noise in the raw loss, which means the training is stable and effective.
6.2 Text to SQL Evaluation Results
6.2.1 Baseline
6.2.2 Trained LoRA Model (Rank 32, Alpha 16)
6.2.3 Example Prompt Comparison
Table Preview (first rows)
| Total# | Series# | Title | Writer | Director | Original air date |
|---|---|---|---|---|---|
| 14 | 1 | ” Sister Hood ” | Dominic Minghella | Ciaran Donnelly | 6October2007, 7.30–8.15pm |
| 15 | 2 | ” The Booby and the Beast ” | Simon Ashford | Ciaran Donnelly | 13October2007, 7.30–8.15pm |
| 16 | 3 | ” Childhood ” | Jason Sutton | Ciaran Donnelly | 20October2007, 7.15–8.00pm |
| 17 | 4 | ” The Angel of Death ” | Julian Unthank | Matthew Evans | 27October2007, 7.15–8.00pm |
| 18 | 5 | ” Ducking and Diving ” | Debbie Oates | Matthew Evans | 3November2007, 7.15–8.00pm |
Schema Used - Total# (TEXT) - Series# (TEXT) - Title (TEXT) - Writer (TEXT) - Director (TEXT) - Original air date (TEXT)
Question:
What’s the title of the episode that Rob Heyland wrote?
Gold SQL Query
SELECT "Title" FROM data WHERE "Writer" = 'Rob Heyland'Baseline Model Generated SQL
SELECT Title FROM Total WHERE Writer = 'Rob Heyland'Trained Model Generated SQL
SELECT "Title" FROM data WHERE "Writer" = 'Rob Heyland'Similarity Metrics
| Metric | Baseline Value | Trained LoRa Value |
|---|---|---|
| Exact Match | 0 | 1 |
| Jaccard | 0.5 | 1 |
| Levenshtein | 0.873 | 1 |
| Token F1 | 0.667 | 1 |
Execution Summary
| Item | Value |
|---|---|
| Baseline Result | ERROR: no such table: Total |
| Trained LoRa Result | [(‘” For England…! “’,)] |
| Gold Result | [(‘” For England…! “’,)] |